This dataset contains Red Wine quality measurements. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Checking the structure and variables of the dataset.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality quality.factor
## Min. :3.000 3: 10
## 1st Qu.:5.000 4: 53
## Median :6.000 5:681
## Mean :5.636 6:638
## 3rd Qu.:6.000 7:199
## Max. :8.000 8: 18
Transformed the quality variable in a ordered factor.
In this section we can see the distribution of the variables in the dataset.
We can see above that the majority of data has ratings between 5/6. With a few rating 3 or 8.
Here in alcohol the distribution lies between 8 to 15%.
For sugar we removed some high outliers (we could check this in the summary with a max value of 15) to verify the structure of the distribution.
For citric acid we can see the majority of the values between 0 and .8
For pH between 2.5 and 4.
The dataset has 1599 observations of 12 variables.
The features that comes to mind first, which we imagine that influences the quality ratings are alcohol, residual.sugar/chlorides, perhaps citric.acid, this one is described as adding “‘freshness’ and flavor to wines”
Perhaps all the other features have underlying significance, however, as they have no described effect in taste, they initially are not being analysed.
Residual sugar has very high outliers. The bulk of data in have quality values of 5 or 6. The citric acid is somewhat evenly distributed between 0 and .5. Converted the quality variable into an ordered factor, since they have fixed classifications.
Here I decided to plot all the variables that were investigated in the first analysis, which I believe are more likely to be correlated with the quality rating.
Between the four plots, we could verify that only alcohol has a visible change in distribution and mean in the various quality ratings.
In this section I decided to plot all the variables together and their correlations to verify if some other feature has some kind of relevance or relationship between them.
With this I could see relationships between pH and the acidity describing variables, which makes sense. However I decided to verify if there was any oddity between the highest of them and pH.
## [1] 0.6717034
## [1] -0.6829782
## [1] -0.5419041
In these we could check that they are reasonably distributed and correlated.
As we could see from the previous boxplots, the only variable that has almost has .5 of correlation with the quality variable is alcohol, with .476, the second closest correlation is a negative one, volatile.acidity, with -.391.
As for other variables we can see negative correlations with slight linear relationship between pH and fixed.acidity and citric.acid, which makes sense, given that pH is an acidity measure, the higher the lower the pH the higher the acidity, and the quantity of citric.acid varies with the fixed acidity. These correlations boast the highest values (.671 and -.682), as we could see.
Verifying the correlation between volatile acidity, in which we could verify earlier in the ggpairs that has a high correlation and could contribute to a predictive model, and alcohol vs quality. Bar a few outliers, we can see that there is a tendency. Curious though, because this feature measure the “vinegar taste” of the wine and its said that the more the wine has this the more unpleasant it tastes.
I also decided to plot the sulphates feature, which has, considering the other features, a good correlation to quality. This is an inorganic compound said not to influence in the taste, but we can verify a little correlation in the plot.
We can verify as well the relation between the three variables we checked before, pH and acidity.
With all this said, we can try to build a model with the best variables available.
##
## Calls:
## m1: lm(formula = quality ~ alcohol + volatile.acidity, data = rw)
## m2: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = rw)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = rw)
##
## ==============================================================
## m1 m2 m3
## --------------------------------------------------------------
## (Intercept) 3.095*** 2.611*** 2.646***
## (0.184) (0.196) (0.201)
## alcohol 0.314*** 0.309*** 0.309***
## (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.265***
## (0.095) (0.097) (0.113)
## sulphates 0.679*** 0.696***
## (0.101) (0.103)
## citric.acid -0.079
## (0.104)
## --------------------------------------------------------------
## R-squared 0.317 0.336 0.336
## adj. R-squared 0.316 0.335 0.334
## sigma 0.668 0.659 0.659
## F 370.379 268.912 201.777
## p 0.000 0.000 0.000
## Log-likelihood -1621.814 -1599.384 -1599.093
## Deviance 711.796 692.105 691.852
## AIC 3251.628 3208.768 3210.186
## BIC 3273.136 3235.654 3242.448
## N 1599 1599 1599
## ==============================================================
We can see that the model does not perform very well as the features are not too strongly descriptive of the quality ratings.
We could see the relationship between the pH and acidity variables clearly in the plot above. We could also verify the volatile acidity/sulphates and alcohol to verify the correlation checked before against the quality ratings. I tried to build a linear model based on some of the variables analysed. The model is not able to explain all the variance in the data, with a r-squared of just .33, which makes it hard to predict with the variables we have.
This first plot, helped me to verify the tendency of alcohol being correlated to the quality score. As we can see the median and mean with a clear upwards distribution tendency as the quality increases. So this is a good feature to be considered for a model, in comparison to all the others.
This is one of the plots in which I could verify one of the strongest relationships in the dataset, the more citric acid the lower the pH, which makes sense with the components of the wine.
In this one I could check the best 3 features description of the wine quality, these three have the best correlation in the dataset so it makes sense to see them all together. It’s curious though, that high volatile acidity makes the wine better, because its the process of the wine “turning into vinegar”.
It’s my debut exploring data on my own so it’s been a nice first challenge exploring this dataset. The structure and values of the dataset are quite simple so exploring them individually had not been that difficult. A thing that was challenging for me was to verify that the variables were not too clearly descriptive of the target, correlations were low for the most of them, so the analysis had not been as straightforward as I thought it would be. I’ve grown accustomed to the R language, it’s a very simple language, to work with, and my knowledge of it being just what I had in the course, it was very comfortable to use it during this analysis. For a future analysis it would be nice to figure out if a combination of the components of acidity could be more descriptive of the quality, and if the quality rating could be broken down into more descriptive features of taste, so we could verify what characteristics, according to the rater, made the quality rating, and verify more clearly if it makes sense with the components data.